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Abstract. 

The graph of communities is a network emerging above the level of individual nodes in 
the hierarchical organisation of a complex system. In this graph the nodes correspond to 
communities (highly interconnected subgraphs, also called modules or clusters), and the links 
refer to members shared by two communities. Our analysis indicates that the development 
of this modular structure is driven by preferential attachment, in complete analogy with the 
growth of the underlying network of nodes. We study how the links between communities are 
born in a growing co-authorship network, and introduce a simple model for the dynamics of 
overlapping communities. 



Introduction. - A wide class of complex systems occurring from the level of cells to society 
can be described in terms of networks capturing the intricate web of connections among the 
units they are made of. Graphs corresponding to these real networks exhibit unexpected 
non-trivial properties, e.g., new kinds of degree distributions, anomalous diameter, spreading 
phenomena, clustering coefficient, and correlations [1-5]. In recent years, there has been 
a quickly growing interest in the structural sub-units of complex networks, associated with 
more highly interconnected parts [6-18]. These sets of nodes are usually called clusters, 
communities, cohesive groups, or modules, with no widely accepted, unique definition. Such 
building blocks (functionally related proteins [19,20], industrial sectors [21], groups of people 
[14,22], cooperative players [23,24], etc.) can play a crucial role in forming the structural and 
functional properties of the involved networks. On the other hand, the presence of communities 
in networks is a relevant and informative signature of the hierarchical nature of complex 
systems [19,25,26]. 

Typically, the communities in a complex system are not isolated from each other, instead, 
they have overlaps, e.g., a protein can be part of more than one functional unit [27], and people 
can be members in different social groups at the same time [28]. This observation naturally 
leads to the definition of the community graph: a network representing the connections between 
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the communities, with the nodes referring to communities and links corresponding to shared 
members between the communities. Accordingly, the community degree d com of a community 
is given by the number of other communities it overlaps with, and is equal to the degree of the 
corresponding node in the community graph. The studies of the relevant statistics describing 
the community graph (i.e., the degree distribution, clustering etc.) of real networks have 
just begun [29]. So far, in the networks investigated, the community degree distribution was 
shown to decay exponentially for low and as a power law for higher community degree values. 
This means that fat tailed degree distributions appear at two levels in the hierarchy of these 
systems: both at the level of nodes (the underlying networks are scale free), and at the level 
of the communities as well. 

Preferential attachment is a key concept in the field of scale-free networks. In a wide range 
of graph models the basic mechanism behind the emerging power law degree distribution is 
that the new nodes attach to the old ones with probability proportional to their degree [2-4] . 
Furthermore, in earlier works the occurrence of preferential attachment was directly demon- 
strated in several real world networks with scale free degree distribution [30-32] . The observed 
fat tails in the degree distribution of the community graphs indicate that the mechanism of 
preferential attachment could be present at the level of communities as well. Our aim in the 
present manuscript is to examine the attachment statistics of communities in order to clarify 
this question. Our investigations focus on the development of the communities in the growing 
co-authorship network of the Los Alamos cond-mat e-print archive [33], in which the nodes 
correspond to authors, and two authors become linked if they publish an article together. By 
studying the time evolution of this system,we investigate the dynamics of the new commu- 
nity links. For example, when a previously unlinked community is attached to another one, 
what are the size and community degree statistics of that other community? Another, closely 
related issue addressed in this paper is the appearance of new members in the communities. 
The size distribution of the communities was found to be a power-law in the system to be 
investigated [29]. Thus it is natural to address questions such as: What happens when a node 
belonging to none of the communities suddenly joins a community? What are the size and 
community degree statistics of the community chosen? 

The communities. - In the present work we study the dynamics of the communities in 
the Los Alamos cond-mat e-print archive [33], in which an article with n authors contributes 
with (n — to the weight of the links between every pairs of its authors. (The dataset 
contains altogether 30739 nodes and 136065 links). The communities are extracted with the 
Clique Percolation Method (CPM) [29, 34] at each time step, using the CFinder package 
freely downloadable from [35]. (Each time step corresponds to one month, and the data set 
contained 143 time steps from February 1992 to April 2004). The communities obtained 
by the CPM correspond to /c-clique percolation clusters in the network. The fc-cliques are 
complete subgraphs of size k (in which each node is connected to every other nodes). A 
/c-clique percolation cluster is a subgraph containing fc-cliques that can all reach each other 
through chains of /c-clique adjacency, where two /c-cliques are said to be adjacent if they share 
k — 1 nodes. The /c-clique percolation clusters can be best visualised with the help of fc-clique 
templates, that are objects isomorphic to a complete graph of k vertices. Such a template 
can be placed onto any fc-cliquc in the graph, and rolled to an adjacent /c-clique by relocating 
one of its vertices and keeping its other k — 1 vertices fixed. Thus, the fc-clique percolation 
clusters (fc-clique communities) of a graph are all those subgraphs that can be fully explored 
by rolling a fc-clique template in them but cannot be left by this template. 

The main advantages of this community definition are that it is not too restrictive, it is 
local, it is based on the density of the links and it allows overlaps between the communities: 
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a node can be part of several fc-clique percolation clusters at the same time. The number of 
communities a given node i belongs to shall be referred to as the membership number rrii of 
the node from now on. 

When applied to weighted networks (such as the present co-authorship network), the CPM 
method has two parameters: the A:-clique size k, and a weight threshold uj* (links weaker than 
oj* are ignored). The criterion used to fix these parameters is based on finding a community 
structure as highly structured as possible. In the present paper we stick to the optimal 
parameter values found in earlier studies of the same co-authorship network [29], given by 
k = 6 and u* = 0.1. 

Determining attachment probabilities. - The method presented below can be applied in 
general to any empirical study of an attachment process where the main goal is to decide 
whether the attachment is uniform or preferential with respect to a certain property {e.g., 
degree, size, etc.) of the attached objects {e.g., nodes, communities etc.). If the process is 
uniform with respect to a property p, then objects with a given p are chosen at a rate given 
by the distribution of p amongst the available objects. However, if the attachment mechanism 
prefers high (or low) p values, then objects with high (or low) p are chosen with a higher 
rate compared to the p distribution of the available objects. To monitor this enhancement, 
one can construct the cumulative p distribution Pt{p) of the available objects at each time 
step t, together with the un-normalised cumulative p distribution of the objects chosen by the 
process between t and t + 1, denoted by wt->t+i{p). The value of wt^t+i{p*) at a given p* 
equals to the number of objects chosen in the process between t and t+1, that had a p value 
larger than p* at t. To detect deviations from uniform attachment, it is best to accumulate 
the ratio of Wt-,t+i{p) and Pt{p) during the time evolution to obtain 



If the attachment is uniform with respect to p, then W{p) becomes a flat function. However, if 
W{p) is an increasing function, then the objects with large p are favoured, if it is a decreasing 
function, the objects with small p are favoured in the attachment process. The advantage of 
this approach is that the rate-like variable Wt^t+i{p) associated to the time step between t 
and t+ 1 is always compared to the Pt{p) distribution at t. Therefore W{p) is able to indicate 
preference (or the absence of preference) even when Pt{p) is slowly changing in time (as in 
the case of the community degree in the co-authorship network under investigation). 

We have tested the above method on simulated graphs grown with known attachment 
mechanisms, i) uniform attachment (new nodes are attached to a randomly selected old node), 
ii) linear preferential attachment (new nodes are attached to old ones with a probability 
proportional to the degree), Hi) and anti-preferential attachment (new nodes are attached to 
the old ones with a probability proportional to exp(— d), where d is the degree). In these 
cases the degree d of the individual nodes plays the role of the parameter p. For each time 
step, we recorded the cumulative degree distribution of the nodes Pt{d), together with the 
number of nodes gaining new links with a degree higher than a given d, labelled by Wt^t+i{d). 
By summing the ratio of these two functions along the time evolution of the system one gets 
W(d) = X)t=cT _1 w t->t+i{d)/Pt{d). In figCJi. we show the empirical results for W{d) obtained 
for the simulated networks grown with the three different attachment rules. The curves reflect 
the difference between the three cases very well: for the uniform attachment probability W{d) 
is flat, for the preferential attachment W{d) is clearly increasing, and for the anti-preferential 
attachment W{d) is decreasing. We have also calculated the attachment statistics of the nodes 
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Fig. 1 - a) The W(d) function for networks grown with known attachment rules: uniform probability 
(squares), linear preferential attachment (open circles), and anti preferential attachment (diamonds), 
b) The W(d) function in the co-authorship network of the Los Alamos cond-mat archive. 

in the studied co-authorship network. As it can be seen in fig EI)., the corresponding W(d) 
curve is increasing, therefore preferential attachment is present at the level of nodes in the 
system. 

Results. - In case of the communities of the investigated co-authorship network, the two 
properties to be substituted in place of p are the community degree d com and the commu- 
nity size s, therefore, the cumulative community size distribution Pt(s) and the cumulative 
community degree distribution P t (cZ com ) were recorded at each time step t. To study the 
establishment of the new community links, we constructed the un-normalised cumulative size 
distribution wt^t+i{s) and the un-normalised cumulative degree distribution wt—tt+i (d com ) 
of the communities gaining new community links to previously unlinked communities. The 
value of these distributions at a given s (or given d com ) is equal to the number of unlinked 
communities at t that establish a community link between t and t+ 1 with a community larger 
than s (or having larger degree than d com ) at t. By accumulating the ratio of the rate-like 
variables and the corresponding distributions we obtain 



W(s)= J2 



4=0 



w t ->t+i(s) 
Pt(s) 



W(d com ) = J2 



4=0 



w t ^ t+1 (d com ) 
P t (d com ) 



(2) 



For the investigation of the appearance of new members in the communities, we recorded 
the un-normalised community size distribution w t -,t+i(s) and the un-normalised community 
degree distribution w t ^t+i(d com ) of the communities gaining new members (belonging previ- 
ously to none of the communities) between t and t + 1. The corresponding distributions that 
can be used to detect deviations from the uniform attachment are 



W(s)= J2 



w t ->t+i(s) 
Pt(s) 



W(d com )= J2 



4=0 



w t ^ t+1 (d com ) 
P t (d com ) 



(3) 



In fig|2^- we show the empirical W(s) and W(s) functions, whereas in fig[2l3. the empirical 
W(d com ) and W{d com ) are displayed. All four functions are clearly increasing, therefore we 
can draw the following important conclusions: 
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• When a previously unlinked community establishes a new community link, communities 
with large size and large degree are selected with enhanced probability from the available 
other communities. 

• When a node previously belonging to none of the communities joins a community, com- 
munities with large size and large degree are selected with enhanced probability from 
the available communities. 




Fig. 2 - a) The W(s) and W(s) functions for the communities of the co-authorship network of the 
Los Alamos cond-mat e-print archive, b) The W(d coln ) and W(d corn ) functions of the same network. 
The increasing nature of these functions indicates preferential attachment at the level of communities 
in the system. 



We note that the community size and the community degree are strongly for higher sizes 
and degrees: large communities have large community degree and vice versa. Therefore, if one 
observes an attachment mechanism that is preferential with respect to either the community 
size, or the community degree, than it must be preferential for both of them. 

A toy model. - In this section we outline a simple model for the growth of overlapping 
communities, in which the preferential attachment of the node to communities results in 
the emergence of a community system with scaling community size and community degree 
distribution. We note that when using the well known models based on preferential attachment 
solely between the nodes [2-4], the resulting graph contains no communities at all at k — 6. 

In our model the underlying network between the nodes is left unspecified, the focus is on 
the content of the communities. During the time evolution, similarly to the models published 
in [36-38], new members may join the already existing communities, and new communities may 
emerge as well. The new nodes introduced to the system choose their community preferentially 
with the community size, therefore the size distribution of the communities is expected to 
develop into a power-law. The appearance of the new community links originates in new 
nodes joining several communities at the same time. The detailed rules of the model are the 
following: 

• The initial state of the model is a small set of communities with random size. 

• The new nodes are added to the system separately. For each new node i, a membership 
rrn is drawn from a Poissonean distribution with an expectation value of A. 
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• If rrii > 1, communities are succeedingly chosen with probabilities proportional to their 
sizes, until is reached, and the node i joins the chosen communities simultaneously. 

• If rrii = 0, the node i joins the group of unclassified vertices. 

• When the ratio r of the group of unclassified nodes compared to the total number of 
nodes N exceeds a certain limit r* , a number of q vertices from the group establish a 
new community. (Obviously, q must be smaller than Nr even in the inital state). 

To be able to compare the results of the model with the community structure of the co- 
authorship network, the runs were stopped when the number of nodes in the model reached 
the size of the co-authorship network. 

Our experience showed that the model is quite insensitive to changes in r or q, and A is 
the only important parameter. For small values (A < 0.3) the resulting community degree 
distribution is truncated, whereas when A is too large (A > 1), a giant community with 
abnormally large community degree appears. For intermediate A values (0.3 < A < 1), the 
community size- and community degree distributions become fat tailed, similarly to the co- 
authorship network. In figEI we show the cumulative community size distribution P(s) and 




Fig. 3 - a) the cumulative community size distribution P(s) (open circles) in our model at A = 0.6 
follows a power- law with an exponent of —1.4 (straight line) ( b) the cumulative community degree 
distribution P(d com ) (filled circles) in our model at the same A. The tail of this distribution follows 
the same power- law as the community size distribution (straight line), similarly to the communities 
found in the co-authorship network [29]. 

the cumulative community degree distribution P(d com ) of the communities obtained in our 
model at A = 0.6. (Changes in the parameters r and q only shifts these curves, their shape 
remains unchanged). Our model grasps the relevant statistical properties of the community 
structure in the co-authorship network [29] quite well: the community size distribution and 
the tail of the community degree distribution follow a power-law with the same exponent. 

Conclusions. We studied the evolution of the community graph in a growing co- 
authorship network. We found that similar processes control the growth of the system at 
different levels in the hierarchy, as the growth of the communities, the development of the 
community graph and the growth of the underlying network are all driven by preferential 
attachment. Inspired by these results, we introduced a simple model for the dynamics of 
overlapping communities leading to scaling size- and community degree distribution. 
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